Setup and Context¶

Introduction¶

Dr Ignaz Semmelweis was a Hungarian physician born in 1818 who worked in the Vienna General Hospital. In the past people thought of illness as caused by "bad air" or evil spirits. But in the 1800s Doctors started looking more at anatomy, doing autopsies and started making arguments based on data. Dr Semmelweis suspected that something was going wrong with the procedures at Vienna General Hospital. Semmelweis wanted to figure out why so many women in maternity wards were dying from childbed fever (i.e., puerperal fever).

Today you will become Dr Semmelweis. This is your office 👆. You will step into Dr Semmelweis' shoes and analyse the same data collected from 1841 to 1849.

The Data Source¶

Dr Semmelweis published his research in 1861. I found the scanned pages of the full text with the original tables in German, but an excellent English translation can be found here.

Upgrade plotly (only Google Colab Notebook)¶

Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.

In [1]:
# %pip install --upgrade plotly
In [2]:
import plotly
plotly.__version__
Out[2]:
'5.10.0'

Import Statements¶

In [3]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

Notebook Presentation¶

In [4]:
pd.options.display.float_format = '{:,.2f}'.format

# Create locators for ticks on the time axis
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

Read the Data¶

In [5]:
df_yearly = pd.read_csv('annual_deaths_by_clinic.csv')
# parse_dates avoids DateTime conversion later
df_monthly = pd.read_csv('monthly_deaths.csv', 
                      parse_dates=['date'])

Preliminary Data Exploration¶

Challenge: Check out these two DataFrames ☝️.

  • What is the shape of df_yearly and df_monthly? How many rows and columns?
  • What are the column names?
  • Which years are included in the dataset?
  • Are there any NaN values or duplicates?
  • What were the average number of births that took place per month?
  • What were the average number of deaths that took place per month?
In [6]:
df_yearly
Out[6]:
year births deaths clinic
0 1841 3036 237 clinic 1
1 1842 3287 518 clinic 1
2 1843 3060 274 clinic 1
3 1844 3157 260 clinic 1
4 1845 3492 241 clinic 1
5 1846 4010 459 clinic 1
6 1841 2442 86 clinic 2
7 1842 2659 202 clinic 2
8 1843 2739 164 clinic 2
9 1844 2956 68 clinic 2
10 1845 3241 66 clinic 2
11 1846 3754 105 clinic 2
In [7]:
df_monthly
Out[7]:
date births deaths
0 1841-01-01 254 37
1 1841-02-01 239 18
2 1841-03-01 277 12
3 1841-04-01 255 4
4 1841-05-01 255 2
... ... ... ...
93 1848-11-01 310 9
94 1848-12-01 373 5
95 1849-01-01 403 9
96 1849-02-01 389 12
97 1849-03-01 406 20

98 rows × 3 columns

In [8]:
print(f"Our yearly data has {df_yearly.shape[1]} \
      columns and {df_yearly.shape[0]} rows.")
print(f"Our monthly data has {df_monthly.shape[1]} \
      columns and {df_monthly.shape[0]} rows.")
Our yearly data has 4       columns and 12 rows.
Our monthly data has 3       columns and 98 rows.

Check for Nan Values and Duplicates¶

In [9]:
df_yearly.isna().any().any()
Out[9]:
False
In [10]:
df_monthly.isna().any().any()
Out[10]:
False

We don't have any NaN values.

In [11]:
df_yearly.duplicated().any()
Out[11]:
False
In [12]:
df_monthly.duplicated().any()
Out[12]:
False

No duplicate data either.

Descriptive Statistics¶

In [13]:
df_yearly.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   year    12 non-null     int64 
 1   births  12 non-null     int64 
 2   deaths  12 non-null     int64 
 3   clinic  12 non-null     object
dtypes: int64(3), object(1)
memory usage: 512.0+ bytes
In [14]:
df_monthly.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    98 non-null     datetime64[ns]
 1   births  98 non-null     int64         
 2   deaths  98 non-null     int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 2.4 KB

Numbers and dates already in the right data types 👌

In [15]:
df_yearly.describe()
Out[15]:
year births deaths
count 12.00 12.00 12.00
mean 1,843.50 3,152.75 223.33
std 1.78 449.08 145.38
min 1,841.00 2,442.00 66.00
25% 1,842.00 2,901.75 100.25
50% 1,843.50 3,108.50 219.50
75% 1,845.00 3,338.25 263.50
max 1,846.00 4,010.00 518.00
In [16]:
df_monthly.describe()
Out[16]:
births deaths
count 98.00 98.00
mean 267.00 22.47
std 41.77 18.14
min 190.00 0.00
25% 242.50 8.00
50% 264.00 16.50
75% 292.75 36.75
max 406.00 75.00

We have an average of 223 deaths per year and 22 deaths per month... and 1844 births per year and 267 births per month.

Percentage of Women Dying in Childbirth¶

Challenge: How dangerous was childbirth in the 1840s in Vienna?

  • Using the annual data, calculate the percentage of women giving birth who died throughout the 1840s at the hospital.

In comparison, the United States recorded 18.5 maternal deaths per 100,000 or 0.018% in 2013 (source).

In [17]:
death_year_pct = df_yearly.deaths.sum()/df_yearly.births.sum()
print(f"In the 1840s in Vienna, {death_year_pct*100:.0f}% \
      of women giving birth died.")
In the 1840s in Vienna, 7%       of women giving birth died.

Visualise the Total Number of Births 🤱 and Deaths 💀 over Time¶

Plot the Monthly Data on Twin Axes¶

Challenge: Create a Matplotlib chart with twin y-axes. It should look something like this.

  • Format the x-axis using locators for the years and months (Hint: we did this in the Google Trends notebook)
  • Set the range on the x-axis so that the chart lines touch the y-axes
  • Add gridlines
  • Use skyblue and crimson for the line colours
  • Use a dashed line style for the number of deaths
  • Change the line thickness to 3 and 2 for the births and deaths respectively.
  • Do you notice anything in the late 1840s?
In [18]:
# Create locators for ticks on the time axis
years = mdates.YearLocator()
months = mdates.MonthLocator()
years_formatter = mdates.DateFormatter('%Y')
In [19]:
plt.figure(figsize=(14,8), dpi=200)
plt.xticks(fontsize=14, rotation=45)
plt.yticks(fontsize=14)
plt.title("Total Number of Monthly Births and Deaths", fontsize=18)
plt.xlabel('Time', fontsize=16)
plt.ylabel('Number of Births', fontsize=16)
plt.xlim(df_monthly.date.min(), df_monthly.date.max())

ax1 = plt.gca()
ax2 = plt.twinx()

ax1.plot('date', 'births', data=df_monthly, c='skyblue', linewidth=3)
ax2.plot('date', 'deaths', data=df_monthly, c='crimson', linewidth=2, 
         linestyle='--')
ax2.set_ylabel('Number of Deaths', fontsize=16)
plt.setp(ax2.get_yticklabels(), fontsize=14)

ax1.grid(color='grey', linestyle='--')

ax1.xaxis.set_major_locator(years)
ax1.xaxis.set_major_formatter(years_formatter)
ax1.xaxis.set_minor_locator(months)

plt.show()

The Yearly Data Split by Clinic¶

Now let's look at the annual data instead.

Challenge: Use plotly to create line charts of the births and deaths of the two different clinics at the Vienna General Hospital.

  • Which clinic is bigger or more busy judging by the number of births?
  • Has the hospital had more patients over time?
  • What was the highest number of deaths recorded in clinic 1 and clinic 2?
In [20]:
fig = px.line(df_yearly, x='year', y='births', color='clinic',
              height=600, title='Total Yearly Births by Clinic',
              labels={'year': 'Time', 'births': 'Number of Births',
                      'deaths': 'Number of Deaths', 'clinic': 'Clinic'})
fig.show()
In [21]:
fig = px.line(df_yearly, x='year', y='deaths', color='clinic',
              height=600, title='Total Yearly Deaths by Clinic',
              labels={'year': 'Time', 'births': 'Number of Births',
                      'deaths': 'Number of Deaths', 'clinic': 'Clinic'})
fig.show()

Challenge: Calculate the proportion of maternal deaths per clinic. That way we can compare like with like.

  • Work out the percentage of deaths for each row in the df_yearly DataFrame by adding a column called "pct_deaths".
  • Calculate the average maternal death rate for clinic 1 and clinic 2 (i.e., the total number of deaths per the total number of births).
  • Create another plotly line chart to see how the percentage varies year over year with the two different clinics.
  • Which clinic has a higher proportion of deaths?
  • What is the highest monthly death rate in clinic 1 compared to clinic 2?
In [22]:
pct_deaths = df_yearly.deaths / df_yearly.births
df_yearly.insert(loc=3, column='pct_deaths', value=pct_deaths)
In [23]:
df_yearly.head(3)
Out[23]:
year births deaths pct_deaths clinic
0 1841 3036 237 0.08 clinic 1
1 1842 3287 518 0.16 clinic 1
2 1843 3060 274 0.09 clinic 1

Calculate the Proportion of Deaths at Each Clinic¶

In [24]:
total_deaths_clinic1 = df_yearly.deaths.loc[df_yearly.clinic=='clinic 1'].sum()
total_deaths_clinic2 = df_yearly.deaths.loc[df_yearly.clinic=='clinic 2'].sum()
In [25]:
total_births_clinic1 = df_yearly.births.loc[df_yearly.clinic=='clinic 1'].sum()
total_births_clinic2 = df_yearly.births.loc[df_yearly.clinic=='clinic 2'].sum()
In [26]:
avg_deaths_clinic1 = total_deaths_clinic1 / total_births_clinic1
avg_deaths_clinic2 = total_deaths_clinic2 / total_births_clinic2
In [27]:
display(avg_deaths_clinic1)
display(avg_deaths_clinic2)
0.09924159265542361
0.038839862852003824

We could also have just done the average of the % column.

In [28]:
display(df_yearly.pct_deaths.loc[df_yearly.clinic=='clinic 1'].mean())
display(df_yearly.pct_deaths.loc[df_yearly.clinic=='clinic 2'].mean())
0.0985052720217127
0.04039993689236435

Actually no, this doesn't work. In this case, I shouldnt average percentages, because the sample size on each row is different, e.g. one year has 4000 births and another has 3000, so each row is not equal. Each row percentage has a different weight.

In [29]:
print(f"The average maternal death rate for"\
      f"clinic 1 is {avg_deaths_clinic1*100:.1f}%")
print(f"The average maternal death rate for"\
      f"clinic 2 is {avg_deaths_clinic2*100:.1f}%")
The average maternal death rate forclinic 1 is 9.9%
The average maternal death rate forclinic 2 is 3.9%

Plotting the Proportion of Yearly Deaths by Clinic¶

In [30]:
fig = px.line(df_yearly, x='year', y='pct_deaths', color='clinic',
              height=600, title='Yearly Maternal Death Rate by Clinic',
              labels={'year': 'Time', 'births': 'Number of Births',
                      'deaths': 'Number of Deaths', 'clinic': 'Clinic',
                      'pct_deaths': 'Maternal Death Rate'})
fig.show()

The story continues...

At first, Dr Semmelweis thought that the position of the women giving birth was the issue. In clinic 2, the midwives' clinic, women gave birth on their sides. In the doctors' clinic, women gave birth on their backs. So, Dr. Semmelweis, had women in the doctors' clinic give birth on their sides. However, this had no effect on the death rate.

Next, Dr Semmelweis noticed that whenever someone on the ward died, a priest would walk through clinic 1, past the women's beds ringing a bell 🔔 . Perhaps the priest and the bell ringing terrified the women so much after birth that they developed a fever, got sick and died. Dr Semmelweis had the priest change his route and stop ringing the bell 🔕 . Again, this had no effect.

At this point, Dr Semmelweis was so frustrated he went on holiday to Venice. Perhaps a short break would clear his head. When Semmelweis returned from his vacation, he was told that one of his colleagues, a pathologist, had fallen ill and died. His friend had pricked his finger while doing an autopsy on a woman who had died from childbed fever and subsequently got very sick himself and died. 😮

The Effect of Handwashing¶

In June 1847, Dr Semmelweis ordered everyone on his medical staff to start cleaning their hands and instruments not just with soap and water but with a chlorine solution (he didn't know it at the time, but chlorine is an amazing disinfectant). The reason Dr Semmelweis actually chose the chlorine was that he wanted to get rid of any smell on doctors' hands after an autopsy. No one knew anything about bacteria, germs or viruses at the time.

In [31]:
# Date when handwashing was made mandatory
handwashing_start = pd.to_datetime('1847-06-01')

Challenge:

  • Add a column called "pct_deaths" to df_monthly that has the percentage of deaths per birth for each row.
  • Create two subsets from the df_monthly data: before and after Dr Semmelweis ordered washing hand.
  • Calculate the average death rate prior to June 1947.
  • Calculate the average death rate after June 1947.
In [32]:
df_monthly['pct_deaths'] = df_monthly.deaths/df_monthly.births
In [33]:
df_monthly.head(3)
Out[33]:
date births deaths pct_deaths
0 1841-01-01 254 37 0.15
1 1841-02-01 239 18 0.08
2 1841-03-01 277 12 0.04
In [34]:
df_monthly_before = df_monthly.loc[df_monthly.date < handwashing_start]
df_monthly_after = df_monthly.loc[df_monthly.date >= handwashing_start]
In [35]:
avg_death_prior = df_monthly_before.deaths.sum() / df_monthly_before.births.sum()
avg_death_after = df_monthly_after.deaths.sum() / df_monthly_after.births.sum()
In [36]:
print(f"The average death rate prior to June 1847 is {avg_death_prior*100:.2f}%")
print(f"The average death rate after June 1847 is {avg_death_after*100:.2f}%")
The average death rate prior to June 1847 is 10.53%
The average death rate after June 1847 is 2.15%

The death rate per birth dropped dramatically after handwashing started - from close to 10.53% to 2.15%. We can use the colon and dot inside a print statement to determine the number of digits we'd like to print out from a number.

Calculate a Rolling Average of the Death Rate¶

Challenge: Create a DataFrame that has the 6 month rolling average death rate prior to mandatory handwashing.

Hint: You'll need to set the dates as the index in order to avoid the date column being dropped during the calculation.

In [37]:
df_monthly_rolling = df_monthly.set_index('date').rolling(window=6).mean()

Wait a minute, I think we're facing the same issue as before when averaging percentages...

For now, if I just display df_monthly_rolling.deaths / df_monthly_rolling.births and df_monthly_rolling.pct_deaths, they look similar though.

Personal Note 1: Return a DataFrame with an entire column replaced¶

if I want to replace an entire column, and return the new DataFrame, the .replace(to_replace=, value=) method doesn't work. This is only to replace actual values inside a DataFrame, e.g. replace all ones by twos.

Instead, to replace an entire column, use .assign() with the name of the column as parameter, and the argument being a scalar or a Series. See examples below:

df_monthly_rolling.assign(pct_deaths=1)
df_monthly_rolling.assign(pct_deaths=df_monthly_rolling.deaths / df_monthly_rolling.births)

Personal Note 2: Comparing two Series with NaN values¶

  • In pandas, NaN is actually not equal to NaN, so NaN != NaN is True. To get the following statement Nan == NaN to return True, we can evaluate it like this: (a!=a)&(b!=b)

    In the end, to compare the values of two Series, where NaN are equal, I can use np.all((a == b) | (np.isnan(a) & np.isnan(b)))

test.pct_deaths == df_monthly_rolling.deaths / df_monthly_rolling.births
a = test.pct_deaths
b = df_monthly_rolling.deaths / df_monthly_rolling.births
(a == b) | (np.isnan(a) & np.isnan(b))  # Return the Series of True or False 
np.all((a == b) | (np.isnan(a) & np.isnan(b))) # Return a unique True or False
  • Or else, to make better use of memory, I can import NumExpr which "is a fast numerical expression evaluator for NumPy. [...] expressions that operate on arrays are accelerated and use less memory than doing the same calculation in Python."
import numexpr

pd.Series(numexpr.evaluate('(a==b)|((a!=a)&(b!=b))')) # Return the Series of True or False 
np.all(numexpr.evaluate('(a==b)|((a!=a)&(b!=b))')) # Return a unique True or False
  • Alternatively, I can use the .equals() Pandas method.
a==b  # Return a Series, where True if values are equal, but False for NaN
(a==b).all()  # Return a unique boolean, which is False in this case because of the NaNs
a.equals(b) # Return a unique boolean too, but True in this case, considering NaN equal
  • Finally, the .compare() method returns a DataFrame with the columns self and other to show the difference between the values when they're not equal, e.g. a.compare(b)

Personal Note 3: Regarding display of large tables¶

In [38]:
pd.options.display.max_rows
Out[38]:
60

Using the command pd.describe_option('display') will give us a bunch of information about the current display options, such as:

  • display.large_repr : 'truncate'/'info'

    For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can show a truncated table (the default from 0.13), or switch to the view from df.info() (the behaviour in earlier versions of pandas).

    [default: truncate] [currently: truncate]      
  • display.max_rows : int

    If max_rows is exceeded, switch to truncate view. Depending on large_repr, objects are either centrally truncated or printed as a summary view. 'None' value means unlimited. [...]

    [default: 60] [currently: 60]
  • display.max_columns : int

    [...]

  • display.min_rows : int

    The numbers of rows to show in a truncated view (when max_rows is exceeded). [...]

We can change these options, like the max number of rows before truncating: pd.set_option('display.max_rows', 500)

But we can also change the option temporarily:

from IPython.display import display
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
    display(df) #need display to show the dataframe when using with in jupyter
    #some pandas stuff
In [39]:
from IPython.display import display

a = df_monthly_rolling.pct_deaths
b = df_monthly_rolling.deaths / df_monthly_rolling.births
In [40]:
a.equals(b)
Out[40]:
False

It seems that indeed, the rolling average of the percentages and the actual averages of the rolling values are not the same. If we want to display the Boolean Series resulting of the comparison, we don't get the whole thing though, so we can temporaily change the display options like so (98 rows in the df):

with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
    display((a == b) | (np.isnan(a) & np.isnan(b)))

And we see that actually only the NaN values are the same (True), all the other values are different (False).

In [41]:
with pd.option_context('display.float_format', '{:,.5f}'.format,
                       'display.min_rows', 6):
    display(a.compare(b))
self other
date
1841-06-01 0.05631 0.05608
1841-07-01 0.04606 0.04379
1841-08-01 0.03576 0.03360
... ... ...
1849-01-01 0.01630 0.01685
1849-02-01 0.02144 0.02157
1849-03-01 0.02805 0.02844

93 rows × 2 columns

By changing the way the float are displayed, we can see that, although very close, the values are indeed different.

So, we need to reassign the right values.

In [42]:
df_monthly_rolling.pct_deaths = df_monthly_rolling.deaths / df_monthly_rolling.births
In [43]:
df_monthly_before_rolling = df_monthly_before.set_index('date')\
                            .rolling(window=6).mean()
df_monthly_before_rolling.pct_deaths =\
    df_monthly_before_rolling.deaths.divide(df_monthly_before_rolling.births)

Highlighting Subsections of a Line Chart¶

Challenge: Copy-paste and then modify the Matplotlib chart from before to plot the monthly death rates (instead of the total number of births and deaths). The chart should look something like this:

  • Add 3 seperate lines to the plot: the death rate before handwashing, after handwashing, and the 6-month moving average before handwashing.
  • Show the monthly death rate before handwashing as a thin dashed black line.
  • Show the moving average as a thicker, crimon line.
  • Show the rate after handwashing as a skyblue line with round markers.
  • Look at the code snippet in the documentation to see how you can add a legend to the chart.

Bringing the date back to a column instead of an index works better for the x and y data in the graphs further below:

In [44]:
df_monthly_before_rolling.reset_index(inplace=True) # not "drop=True" here!
In [45]:
plt.figure(figsize=(14,8), dpi=200)
plt.xticks(fontsize=14, rotation=45)
plt.yticks(fontsize=14)
plt.title("Evolution of Maternal Death Rate", fontsize=18)
plt.xlabel('Time', fontsize=16)
plt.ylabel('Rate of Maternal Death', fontsize=16)
plt.xlim(df_monthly.date.min(), df_monthly.date.max())

plt.plot('date', 'pct_deaths', data=df_monthly_before, 
         c='black', linewidth=1, linestyle='--')
plt.plot('date', 'pct_deaths', data=df_monthly_before_rolling, 
         c='crimson', linewidth=3, linestyle='--')
plt.plot('date', 'pct_deaths', data=df_monthly_after, 
         c='skyblue', linewidth=3, marker='o')

ax = plt.gca()
ax.grid(color='grey', linestyle='--')
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(years_formatter)
ax.xaxis.set_minor_locator(months)

plt.legend(['Before Handwashing', '6months Rolling Average', 
            'After Handwashing'], fontsize=16)
plt.show()

Using the lines returned by .plot()¶

It is noteworhty to know that plt.plot() actually returns a list of Line2D (a.k.a. a list of lines representing the plotted data).

We can then reuse them in some functions, like .legend(), e.g.

ma_line, = plt.plot(...)
bw_line, = plt.plot(...)
aw_line, = plt.plot(...) 
plt.legend(handles=[ma_line, bw_line, aw_line],
           fontsize=18)

Statistics - Calculate the Difference in the Average Monthly Death Rate¶

Challenge:

  • What was the average percentage of monthly deaths before handwashing?
  • What was the average percentage of monthly deaths after handwashing was made obligatory?
  • By how much did handwashing reduce the average chance of dying in childbirth in percentage terms?
  • How do these numbers compare to the average for all the 1840s that we calculated earlier?
  • How many times lower are the chances of dying after handwashing compared to before?
df_monthly_before.describe()
df_monthly_after.describe()

Note: This time, we do ask for an average of the monthly death rate, so an average of averages.

In [46]:
avg_death_monthly_before = df_monthly_before.pct_deaths.mean()
avg_death_monthly_after = df_monthly_after.pct_deaths.mean()
print(f"The average percentage of monthly deaths before handwashing \
      was {avg_death_monthly_before*100:.2f}%")
print(f"The average percentage of monthly deaths after handwashing \
      was {avg_death_monthly_after*100:.2f}%")
The average percentage of monthly deaths before handwashing       was 10.50%
The average percentage of monthly deaths after handwashing       was 2.11%
In [47]:
print(f"The chance of dying in childbirth dropped \
{(avg_death_monthly_before-avg_death_monthly_after)*100:.2f}%")
The chance of dying in childbirth dropped 8.40%

The previous death rate was avg_death_prior and the new death rate avg_death_after, so the chance of dying reduced by:

In [48]:
print(f"a factor {avg_death_prior/avg_death_after:.1f}!")
a factor 4.9!

A lot of statistical tests rely on comparing features of distributions, like the mean. We see that the average death rate before handwashing was 10.5%. After handwashing was made obligatory, the average death rate was 2.11%. The difference is massive. Handwashing decreased the average death rate by 8.4%, a 5x improvement.

Note: careful with wording, "is 8.4% less" DOES NOT have the same meaning as "reduced by or decreased by 8.4%".

Use Box Plots to Show How the Death Rate Changed Before and After Handwashing¶

The statistic above is impressive, but how do we show it graphically? With a box plot we can show how the quartiles, minimum, and maximum values changed in addition to the mean.

Challenge:

  • Use NumPy's .where() function to add a column to df_monthly that shows if a particular date was before or after the start of handwashing.
  • Then use plotly to create box plot of the data before and after handwashing.
  • How did key statistics like the mean, max, min, 1st and 3rd quartile changed as a result of the new policy?

Example with NumPy .where()¶

>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.where(a < 5, a, 10*a)
array([ 0,  1,  2,  3,  4, 50, 60, 70, 80, 90])
In [49]:
df_monthly['washing_hands'] = np.where(df_monthly.date < handwashing_start, 'No', 'Yes')
df_monthly.sample(2)
Out[49]:
date births deaths pct_deaths washing_hands
87 1848-05-01 313 3 0.01 Yes
63 1846-05-01 305 41 0.13 No
In [50]:
fig = px.box(df_monthly, x='washing_hands', y='pct_deaths', 
             color='washing_hands', height=600,
             labels={'washing_hands': 'Washing Hands?',
                     'pct_deaths': 'Rate of Monthly Maternal Death'},
             title='How Have the Stats Changed with Handwashing?')
fig.show()

The plot shows us the same data as our Matplotlib chart, but from a different perspective. Here we also see the massive spike in deaths in late 1842. Over 30% of women who gave birth that month died in hospital. What we also see in the box plot is how not only did the average death rate come down, but so did the overall range - we have a lower max and 3rd quartile too. Let's take a look at a histogram to get a better sense of the distribution.

Use Histograms to Visualise the Monthly Distribution of Outcomes¶

Challenge: Create a plotly histogram to show the monthly percentage of deaths.

  • Use docs to check out the available parameters. Use the color parameter to display two overlapping histograms.
  • The time period of handwashing is shorter than not handwashing. Change histnorm to percent to make the time periods comparable.
  • Make the histograms slighlty transparent
  • Experiment with the number of bins on the histogram. Which number work well in communicating the range of outcomes?
  • Just for fun, display your box plot on the top of the histogram using the marginal parameter.

Library to find divisors of a number, and other things¶

SymPy is a really powerful math library, see examples here.

We can use it to find all the divisors of a number for instance (which is actually a very complex task).

In [51]:
from sympy import divisors, divisor_count
In [52]:
divisor_count(98)
Out[52]:
$\displaystyle 6$
In [53]:
divisors(98)
Out[53]:
[1, 2, 7, 14, 49, 98]

Using new parameters in Plotly, with histogram: color=, opacity=, barmode=, histfunc=, histnorm=

In [54]:
fig = px.histogram(df_monthly, x='pct_deaths', histfunc='count',
                   nbins=30, color='washing_hands', opacity=0.6,
                   barmode='overlay')
# using color divides the histogram in two
fig.update_layout(xaxis_title='Rate of Monthly Maternal Deaths',
                  yaxis_title='Number of Months')
fig.show()

However, the sample of data for "No Washing Hands" is bigger, so we can't really compare the two histograms by the count. We'll change the histnorm to 'percent' instead, meaning, for example, 40% of the months had a dead rate of such.

In [55]:
fig = px.histogram(df_monthly, x='pct_deaths', histfunc='count',
                   nbins=30, color='washing_hands', opacity=0.6,
                   barmode='overlay', histnorm='percent')
fig.update_layout(xaxis_title='Rate of Monthly Maternal Deaths',
                  yaxis_title='% of months')
fig.show()

This kinda tells the same story as the box plot, i.e. all the data is concentrated to a low death rate for the "Yes Washing Hands", and it's much more spread-out with a higher average for the "No".

We can actually display both of them together, using the marginal= parameter.

In [56]:
fig = px.histogram(df_monthly, x='pct_deaths', histfunc='count',
                   nbins=30, color='washing_hands', opacity=0.6,
                   barmode='overlay', histnorm='percent',
                   marginal='box', height=500)
fig.update_layout(xaxis_title='Rate of Monthly Maternal Deaths',
                  yaxis_title='Number of Months')
fig.show()

Now, we have only about 98 data points or so, so our histogram looks a bit jagged. It's not a smooth bell-shaped curve. However, we can estimate what the distribution would look like with a Kernel Density Estimate (KDE).

Use a Kernel Density Estimate (KDE) to visualise a smooth distribution¶

Challenge: Use Seaborn's .kdeplot() to create two kernel density estimates of the pct_deaths, one for before handwashing and one for after.

  • Use the shade parameter to give your two distributions different colours.
  • What weakness in the chart do you see when you just use the default parameters?
  • Use the clip parameter to address the problem.
In [57]:
plt.figure(figsize=(10,6), dpi=120)
kde = sns.kdeplot(data=df_monthly, x='pct_deaths', hue='washing_hands',
                  clip=(0,1), common_norm=False,
                  fill=True)
kde.set(xlim=(0, 0.4), xlabel='Rate of Monthly Maternal Deaths',
        ylabel='Density Distribution')
kde.legend_.set_title('Washing Hands?')
plt.title('Estimated Distribution of Monthly Maternal Death Rate')
plt.show()

The shade parameter is deprecated, it is fill now. This just fills up the area under the curve with a color.

Initially, the KDE goes into the negative on the x-axis, so we use the clip=(0,1) to "not evaluate the density outside of these limits".

Finally, we use common_norm=False to "normalize each density independently", i.e. the area under each sums to 1. If True, then it's the the "total area under all densities" which "sums to 1", i.e. they're normalized in common.

Now that we have an idea of what the two distributions look like, we can further strengthen our argument for handwashing by using a statistical test. We can test whether our distributions ended up looking so different purely by chance (i.e., the lower death rate is just an accident) or if the 8.4% difference in the average death rate is statistically significant.

Use a T-Test to Show Statistical Significance¶

Challenge: Use a t-test to determine if the differences in the means are statistically significant or purely due to chance.

If the p-value is less than 1% then we can be 99% certain that handwashing has made a difference to the average monthly death rate.

  • Import stats from scipy
  • Use the .ttest_ind() function to calculate the t-statistic and the p-value
  • Is the difference in the average proportion of monthly deaths statistically significant at the 99% level?

What Is a T-Test?

A t-test is an inferential statistic used to determine if there is a significant difference between the means of two groups and how they are related. T-tests are used when the data sets follow a normal distribution and have unknown variances, like the data set recorded from flipping a coin 100 times.

The t-test is a test used for hypothesis testing in statistics and uses the t-statistic, the t-distribution values, and the degrees of freedom to determine statistical significance.

T-Score

A large t-score, or t-value, indicates that the groups are different while a small t-score indicates that the groups are similar.

P-value

Every t-value has a p-value to go with it. A p-value from a t test is the probability that the results from your sample data occurred by chance. P-values are from 0% to 100% and are usually written as a decimal (for example, a p value of 5% is 0.05). Low p-values indicate your data did not occur by chance. For example, a p-value of .01 means there is only a 1% probability that the results from an experiment happened by chance.

In [58]:
from scipy import stats
In [59]:
t_test = stats.ttest_ind(a=df_monthly.pct_deaths.loc[df_monthly.washing_hands == 'No'],
                         b=df_monthly.pct_deaths.loc[df_monthly.washing_hands == 'Yes'])
t_test
Out[59]:
Ttest_indResult(statistic=5.511607211341916, pvalue=2.985436556724523e-07)
In [60]:
print(f"{t_test.pvalue:.10f}")
0.0000002985

The P-value is incredibly small! There is no way in the world the results occurred by chance.

We could also use the ttest_ind like so:

In [61]:
t_stat, p_value = stats.ttest_ind(a=df_monthly_before.pct_deaths, 
                                  b=df_monthly_after.pct_deaths)
print(f'p-palue is {p_value:.10f}')
print(f't-statstic is {t_stat:.4}')
p-palue is 0.0000002985
t-statstic is 5.512

When we calculate the p_value we see that it is 0.0000002985 or .00002985% which is far below even 1%. In other words, the difference in means is highly statistically significant and we can go ahead on publish our research paper

What do you conclude from your analysis, Doctor? 😊